Skip to content

Colab#4

Draft
ironmanizawesome wants to merge 15 commits into
CandleLabAI:mainfrom
ironmanizawesome:colab
Draft

Colab#4
ironmanizawesome wants to merge 15 commits into
CandleLabAI:mainfrom
ironmanizawesome:colab

Conversation

@ironmanizawesome

Copy link
Copy Markdown

No description provided.

ironmanizawesome and others added 2 commits May 6, 2026 02:18
PCBClassNet.build() was passing the (model, learning_layer1, learning_layer2)
tuple straight into get_classification, which expects a single Keras Model.
Unpack so the classification head receives the encoder model as intended,
making the classification path actually buildable.

Also adds CLAUDE.md (project guidance) and ignores .claude/ working state
plus training log files.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds notebooks/colab_train.ipynb covering the full pipeline (clone,
TF 2.10 pin, Drive mount, data unzip, seg + class training with
checkpoint backup to Drive) so an 8 GB local GPU isn't a blocker.

Pins TF 2.10.1 + keras 2.10 + protobuf 3.19.6 in the install cell —
Colab's bundled TF (2.15 with Keras 3) breaks `tf.keras.activations.softmax`
calls and a few other patterns this codebase relies on.

notebooks/README.md captures the data zip layout, why TF 2.10, and a
VRAM cheat sheet for the common Colab GPUs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@ironmanizawesome ironmanizawesome marked this pull request as draft May 5, 2026 18:37
ironmanizawesome and others added 13 commits May 6, 2026 04:03
Colab's default Python is 3.12, which has no TF 2.10 wheels available
(`pip install tensorflow==2.10.1` fails with "No matching distribution").
Insert a condacolab.install() step that swaps the kernel to a Python 3.10
base, then install the verified TF 2.10 stack on top.

The kernel auto-restarts after condacolab.install(); the cloned repo on
/content survives the restart so subsequent cells just resume.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Restructures the notebook so the entire data prep pipeline runs in Colab
from the raw FPIC archive (~7 GB) instead of requiring a pre-zipped
processed dataset (~18 GB):

- §4 unzips data_raw.zip (pcb_image + smd_annotation)
- §5 runs create_mask.py (GPU-accelerated EDSR upscaling)
- §6 runs create_patches.py (768 px patches + 80/20 train/val split)
- §§7-10 unchanged training/eval flow with section numbers shifted

Caps full training at 40 epochs for both segmentation and classification.
Colab Pro caps a single session at 24 h with a 90-min idle limit and no
background execution; Seg 100 + Class 100 (~30-37 h) cannot fit.
Seg 40 + Class 40 fits comfortably in roughly 12 h on a T4.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The latest condacolab defaults to Python 3.11, which TF 2.10 also has no
wheels for (only 3.7–3.10). Pass python_version="3.10" so the kernel
restart lands on a Python 3.10 base that the TF 2.10 install can match.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Drop the condacolab Python 3.10 dance. Colab's default Python keeps
moving past TF 2.10's wheel matrix (now 3.11/3.12), and the latest
condacolab doesn't accept python_version on install_miniforge. TF 2.15
is the last TF release on Keras 2 (Keras 3 starts at TF 2.16) and
ships wheels for the Python versions Colab actually serves, so the
codebase's tf.keras.backend.{dot,transpose} usage keeps working with
no source changes.

Also rewrites the notebook from scratch to clean up duplicate cells
that crept in during incremental NotebookEdit changes (two ## 6 / ## 7
sections, both 100- and 40-epoch training cells, missing sanity cells).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Plain `pip install tensorflow==2.15.0` on Colab falls back to CPU because
Colab's bundled CUDA libs are pinned to whatever TF version Colab ships,
not 2.15. The `[and-cuda]` extra pulls in matching nvidia-cudnn-cu12 /
cublas-cu12 / etc. wheels alongside TF, which is what TF's GPU loader
actually expects to dlopen.

Without this, training falls back to CPU and create_mask.py / train_*.py
take ~10× longer with periodic "Cannot dlopen some GPU libraries"
warnings in stderr.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…oken)

tensorflow[and-cuda]==2.15.0 fails to resolve because the extra pins
tensorrt-libs==8.6.1, which has been removed from PyPI (only 9.x is
still available). Drop the bracket extra and install nvidia-cudnn-cu12,
nvidia-cublas-cu12, etc. by name in a separate pip call. TF needs them
at dlopen time but doesn't actually use TensorRT for training.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Colab's notebook kernel runs on Python 3.12, but TF 2.15 only ships
wheels for Python 3.9–3.11. Colab images already include
/usr/local/bin/python3.11; install the TF 2.15 stack into that
interpreter and run create_mask.py / create_patches.py /
train_*.py via !python3.11 instead of !python.

The notebook kernel itself stays on Python 3.12 — we never
import tensorflow from kernel cells, just shell-out to
python3.11 for everything that touches TF.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…cell

DISLoss's SSIM gradient backward path spikes a 416 MB tensor
([batch=16, 26 classes, 512, 512]) that fragments allocator on T4 16 GB
GPUs and OOMs even though plenty of free memory exists. TF itself
recommends `cuda_malloc_async` in this case. Add it as a prefix to
every train/eval invocation so the recommendation actually fires;
on L4 24 GB it's redundant but harmless.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
L4 Colab images don't ship python3.11 (T4 ones do). Add a guard that
installs python3.11 from deadsnakes PPA when it's missing.

Pin every nvidia-*-cu12 wheel to the version TF 2.15 expects to dlopen:
- nvidia-cudnn-cu12==8.9.4.25 (latest is 9.x; TF 2.15 needs libcudnn.so.8)
- nvidia-cublas-cu12==12.2.5.6 etc.

Without these pins TF 2.15 falls back to CPU on a fresh runtime because
it can't find the right .so versions, and the warnings are easy to miss.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
40 was too aggressive a cut from the paper's 100. 80 is the sweet
spot: enough room for ReduceLROnPlateau (patience=15) to fire and
fine-tune, while still fitting inside Colab Pro's 24 h session limit
(~9h per model on L4 = 18h total + preprocessing buffer).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The first 80-epoch segmentation run lands val_dice around 0.71 with train
dice 0.92 (clear overfit) and lr already at min_lr=1e-5. Add a second
optional stage that resumes from best_seg.h5 with a lower lr range so
ReduceLROnPlateau can keep stepping down past 1e-5.

Changes:
- train_segmentation.py: -resume CLI flag; when set, model.load_weights
  is called on the configured checkpoint path before fit().
- src/cfs/pscn_seg_finetune.yml: same architecture as pscn_seg.yml but
  lr=1e-5 (where the first run left off) and min_lr=1e-6.
- notebooks/colab_train.ipynb: new §8b that restores best_seg.h5 from
  Drive if missing, runs 20 epochs with -resume + the finetune config,
  then re-mirrors the best checkpoint.
- .gitignore: ignore /best_*.h5 and root-level *.zip (Colab artifacts
  that landed in the working tree).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Brings the fine-tune second stage into the main Colab branch so users
running the notebook always have §8b available without switching
branches.
TF 2.15 lazy-loads tf.keras, and accessing __version__ on it raises
AttributeError mid-cell, swallowing the GPU print that follows. Print
TF version + GPU list only; users who specifically need the keras
version can run it in a separate cell.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant